Chapter 8 Cluster Refinement
Not stoked that so many bigger clusters seem to get broken up even though they seem nicely separated on the UMAP projection. Would love to deal with the big center blob a bit better. Two ideas for potential refinement:
- Run the clustering again on centroids of clusters
- Focus on the blob and see if treating it separately helps - potentially less information overall to squeeze into the viz, allowing for more separation - divide and conquer.
The first idea is easier so we’ll start there:
8.1 Refinement Idea 1: Clustering the centroids
cen_clus = hdbscan(centroids, 3) # Down to 78 Clusters...Looks Pretty Good.
# Omit the 2 outside
fig <- plot_ly(type = 'scatter', mode = 'markers')%>%
add_trace(x = centroids[,1],
y = centroids[,2],
text = ~paste('Key Words:', displayWords,"$<br>Cluster Number: ", cen_clus$cluster ),
color=factor(cen_clus$cluster),
showlegend = FALSE)
fig## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Now we just need a function that maps the new centroid clustering back to the original points. Essentially one line of code in R, thanks to subsetting functionality (final line of function remapClusters below) but with the minor problem that noise points create an extra cluster. We simply add the noise cluster to the vector as cluster number k+1, and give it a value of 0 similar to the noise points.
Additional thought (not implemented) leave the noise points IN and cluster them with the centroids. This is a good idea because it allows points that were previously labeled as noise to potentially join a cluster of nearby centroids.
remapClusters = function(cen_clus,clus){
k = length(clus$cluster_scores)
c=as.vector(clus$cluster)
c[c==0]=k+1
cc=as.vector(cen_clus$cluster)
cc[k+1]=0
new = cc[c]
return(new)
}8.2 Grand Visualization of Refined Clusters
newclusters = remapClusters(cen_clus, clus)
newclusters = newclusters[index_subset]
fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
add_trace(
x = data_subset[,1],
y = data_subset[,2],
text = ~paste('Heading:', head_subset ,"$<br>Text: ", raw_text_subset ,"$<br>Cluster Number: ", clusters),
hoverinfo = 'text',
color = factor(newclusters),
showlegend = F
)
fig## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors